Advancing Data Science in Official Statistics
A bold new approach for the US Census Bureau to produce statistical products using survey, administrative, procedural, and opportunity data is introduced. To develop the capabilities, numerous use cases defined by stakeholders are required. The Curated Data Enterprise framework is described in four sections, a use case is presented in Section 3 to demonstrate the feasibility of this new approach.
Unites States Census Bureau Agreement No. 01-21-MOU-06 and
Alfred P. Sloan Foundation Grant No. G-2022-19536
1 The Policy Problem
Sallie Keller, Stephanie Shipp, Vicki Lancaster, and Joseph Salvo
University of Virginia
(Sallie Keller was at the University of Virginia when this work was conducted. She is now the Chief Scientist and Associate Director of Research and Methodology at the U.S. Census Bureau. The views expressed in this perspective are those of the authors and not the Census Bureau.)
Two centuries ago, when the Framers of the US Constitution laid the cornerstone for the federal statistical system, they could not have imagined the complexity of questions future generations would want to ask or the variety of data sources available to address them. Back in 1787, counting the population and apportioning state seats in the House of Representatives were the most urgent tasks before the young nation, and so a requirement for a decennial census was written into the Constitution. Now, 233 years later, the census continues to serve its original purpose – but purposes and uses for census data have exploded.
Questions we now seek to answer go beyond what the census (or surveys) alone can hope to address. Even with the multitude of other surveys commissioned by today’s Census Bureau, researchers and policymakers find themselves looking to novel sources of data – from structured numeric data in traditional databases to unstructured text documents scraped from the internet – to explore issues such as understanding how prepared nursing homes and communities are for extreme climate events such as hurricanes, wildfires, or floods. Wrangling these sources with traditionally designed data, such as censuses and surveys, can fill data gaps, improve the quality and usefulness of statistical products, speed up their dissemination, and inspire the creation of new types of statistical products.
That is the impetus for developing the Curated Data Enterprise(CDE), an innovation in data science aimed at creating statistical products from all data types and building the infrastructure to support them. The Curated Data Enterprise, as the name implies, includes an end-to-end curation model to capture the complete statistical product development process. The CDE is designed to enable data discovery and retrieval, data quality assessment across multiple and diverse sources of information, and the reuse of data and models over time to accelerate statistical product development. The Census Bureau has partnered with the University of Virginia, a working group of former Census Bureau Directors and a Communication Director, university, non-profit, and industry experts to develop this approach.
The Census Bureau provides the latest official statistics, facts, and figures about America’s people, places, and economy. It collects data for 130 surveys annually and the decennial census that gives the Bureau its name. The Census Bureau collects data from households, businesses, governments, and non-profit organizations. For each survey, tabulations and margins of error are published in news releases and reports. Public-use microdata subject to disclosure rules are provided for household and demographic surveys. Microdata for economic and household surveys, without disclosure rules applied, are accessible to researchers through the Federal Statistical Research Data Centers.
Statistical agencies in other countries are also modernizing their surveys and statistical product development. See a summary of selected countries (Lanman, Davis, and Shipp 2023).
To realize the CDE vision, the development of statistical products will address stakeholder questions using all data types – designed surveys and censuses, public and private administrative data, opportunity data scraped from the internet, and procedural data (Keller et al. 2022). This new approach aligns with the Census Bureau’s modernization and transformation (Thieme 2022) while maintaining the fundamental responsibilities of statistical agencies (OMB 2023). It is also consistent with a conclusion by the NASEM Panel on the Implications of Using Multiple Data Sources for Major Survey Programs: “The quality of statistics produced from multiple data sources depends on properties of the individual sources as well as the methods used to combine them. A new framework of quality standards and guidelines is needed to evaluate such data sources’ fitness for use” (NASEM 2023, 192).
The CDE approach provides such a framework to address many of the challenges that official statistics face today, as well as demonstrate that they are poised to adopt a new approach to producing official statistics. For example:
The timeliness and frequency of our official statistics are insufficient when there are shocks to the economy, such as the COVID-19 pandemic, when retrospective survey data were of limited usefulness. Federal agencies responded during the pandemic with relevance and agility by creating and launching fast-response Household Pulse Surveys that met immediate needs for data, trading off timeliness for quality (Groshen 2021). Public engagement and support for these new relevant and timely data products at a time of crisis were essential to the success of this new statistical product.
The policy environment has responded to technological, social, and survey changes by encouraging efficient use of existing data, reuse, sharing, and furthering open data principles. Researchers are now creating innovative statistical products using multiple data sources to better address the United States’ needs and interests. The Commission on Evidence-Based Policymaking (Abraham et al. 2018) and the Federal Data Strategy (“Federal Data Strategy, Leveraging Data as a Strategic Asset” 2021) recommendations encourage agencies to permit access to data to undertake evaluation and research studies.
Techniques like rapid scanning, text recognition, user-friendly uploads, and new devices, sensors, and systems can now record and transcribe data in real time. Using these techniques, governments and corporations now routinely and instantaneously collect and store data on behaviors and states as varied as purchase transactions, climate and road conditions, health care plan utilization, and land use and zoning. Extensive digitization and recording, better system connectedness and interactivity, and increased human-computer interaction can result in faster data accumulation, enhancing the usability of private and public administrative data while maintaining privacy and confidentiality (Brady 2019; Jarmin 2019).
New techniques and data sources can transform statistical agencies “from the 20th-century survey-centric model to a 21st-century model that blends structured survey data with administrative and unstructured alternative digital data sources,” leading to better measures of the gig economy, retail sales, healthcare, workforce, and tools and methods to integrate multiple data sources while maintaining privacy and confidentiality (Jarmin 2019).
This is the first section of four that describes the Curated Data Enterprise. The other three other sections will:
Provide an overview of the CDE and its corresponding framework (see Section 2).
Put the CDE framework into practice through a demonstration use case on the resilience of skilled nursing facilities (see Section 3).
Describe our next steps for developing the CDE through a use case research program (see Section 4).
The US Census Bureau faces the challenge of addressing complex questions requiring novel datasets and sources to answer. Official statistical agencies and public- and private-sector organizations worldwide share this challenge. Read on to discover how the Curated Data Enterprise approach might help you address your own research challenges.
2 What is the Curated Data Enterprise?
Sallie Keller, Stephanie Shipp, Vicki Lancaster, and Joseph Salvo
University of Virginia
2.1 Introduction
Today, official statistics – tables, reports, and microdata – are produced using data from a single survey. These surveys are foundational for researchers and policymakers. However, many issues cannot be answered by surveys alone. For example, creating a picture of how prepared skilled nursing facilities (SNFs) are for climate emergencies requires wrangling all types of data about the facilities and their communities.(Note: A skilled nursing facility is a facility that meets specific federal regulatory certification requirements that enable it to provide short-term inpatient care and services to patients who require medical, nursing, or rehabilitative services.) This includes SNF data on the number and dates of inspections, deficiencies, resident mental and physical health, the number of nursing staff and where they live, community assets data on the number of shelter facilities, health professionals, and emergency service providers, and community risks data on the probability of an extreme climate event. Can we create new statistical products useful to policymakers, emergency responders, skilled nursing facility staff, and others to inform their decisions?
Official statistics are essential for a democratic society as they provide economic, demographic, social, and environmental data about the government, the economy, and the environment. Official statistical agencies should compile and make these statistics available impartially to honor the right to public information.
Objective, reliable, and accessible official statistics instill confidence in the integrity of government and public decision-making regarding a country’s economic, social, and environmental situation at national and international levels. They should be widely available and meet the needs of various users (United Nations 2024).
With the explosion of available data, there is an opportunity to combine all types of information to create statistical products that address cross-cutting topics for a wide range of purposes and uses. The US Census Bureau is modernizing and transforming its enterprise system to accommodate a new way to produce statistical products that take advantage of all data types: designed surveys and censuses, public and private administrative data, opportunity data scraped from the Internet, and procedural data (Keller et al. 2022).
“We are moving towards a single enterprise, data-centric operation that enables us to funnel data from many sources in a single data lake using common collection and ingestion platforms … This is the essence of a curated data approach — assemble, assess, and fill in the gaps to create quality statistical data.”
Robert Santos, Director, U.S. Census Bureau
This curated approach is embodied in the Curated Data Enterprise(CDE). The Curated Data Enterprise Framework in Figure 1 provides a guide for creating statistical products that enable the full integration of data from many sources (Keller et al. 2020). At the heart of the framework are the purposes and uses that provide the context and driving force for developing the statistical product. The outer rectangle in Figure 1 identifies the guiding principles for ethical, transparent, and reproducible product development and dissemination. The inner rectangle identifies the steps in the statistical product development, including integrating primary and secondary data sources. The arrows convey that this process may only sometimes be linear. Instead, the process is iterative, where new information may be discovered at any point, requiring reevaluating and updating prior steps. Our Social and Decision Analytics research group in the Biocomplexity Institute developed, tested, and refined the CDE (data science) framework in our research since 2013 (Keller, Lancaster, and Shipp 2017; Keller et al. 2020). The proposed use of the CDE to develop statistical products at the Census Bureau is in its early stages.
Section 3 of this series puts the CDE framework into practice by demonstrating the use case on skilled nursing facilities’ preparedness for emergencies during extreme climate events. As a prelude to Section 3, we have created a visual for the statistical product development component of how that process works in action in Figure 2.
The CDE Framework’s guiding principles and research steps are described below. To find out more click on a cross reference.
Guiding Principles:
- Section 2.2.1 Purposes & Uses
- Section 2.2.2 Stakeholders
- Section 2.2.3 Curation
- Section 2.2.4 Equity & Ethics
- Section 2.2.5 Privacy & Confidentiality
- Section 2.2.6 Communications & Dissemination
Research Steps:
- Section 2.3.1 Subject Matter Input
- Section 2.3.2 Data Discovery
- Section 2.3.3 Data Ingestion & Governance
- Section 2.3.4 Data Wrangling
- Section 2.3.5 Fitness-for-Purpose
- Section 2.3.6 Statistics Development
2.2 Guiding Principles
2.2.1 Purposes & Uses
The CDE is centered on developing statistical products to meet specific purposes & uses. Researchers and stakeholders propose the purposes and uses, defining the “why” for developing statistics and statistical products. They include questions or issues that the statistics should be designed to support and are clarified by documented best practices, literature reviews, and conversations with subject matter experts.
2.2.2 Stakeholders
Stakeholders include individuals, groups, and organizations that have the potential to affect or be affected by the outcome of the research. Engaging stakeholders is crucial for fostering the connection and trust that can lead to better decision making. Kujala et al. (2022) best described the principle of stakeholder engagement: “Stakeholder engagement refers to the aims, activities, and impacts of stakeholder relations in a moral, strategic, and pragmatic manner.” When placed within the CDE context and represented in the framework, collaborative engagement with stakeholders occurs at all stages of product development to better understand what the final product needs to look like. Further, product development is not a linear process but occurs through successive waves of iteration with users.
Forming partnerships with stakeholders is instrumental in identifying requirements and implementing statistical products. This requires listening to community voices in an active engagement strategy.1 Of necessity, these partnerships entail collaboration, such as creative and collaborative problem-solving workshops and the development of innovative digital tools vetted by networks of users.2
2.2.3 Curation
The broad meaning of curation is the act of organizing, documenting, and maintaining a collection of artifacts. The artifacts of the development and dissemination of statistics or statistical products include all the components in Figure 1, from meeting with stakeholders to formulating the purposes and uses to creating and disseminating the statistical products. Maintaining the artifacts is the essence of the CDE. Every step in the process should be documented and easily accessible in a repository, for example, GitHub, for the work to be transparent and reproducible. Curation in the context of the CDE is an end-to-end activity. It involves documenting the purpose and use, providing the context for acquiring, wrangling, and archiving data from many sources to support the development of statistical products. It will include metadata (Cannon 2013), the code used to read and write the data, and the code that ingested the data from the source and prepared it for analysis.
Curation Steps
- Document the development of the research questions, why this research is important, and how it supports the purposes and uses and resulting statistical product.
- Document the context for the purposes and uses, i.e., a policy directive, stakeholder request, policy evaluation, etc.
- What stakeholder engagement and transparency are built into the process?
2.2.4 Equity & Ethics
An ethics review ensures dialogue on this topic throughout the statistical product development and dissemination life cycle. This involves teams of researchers and stakeholders across many areas of expertise, each with its own research integrity norms and practices. This requires that ethics be woven into every aspect of the CDE. An equity review ensures that underserved groups are represented and biases inherent in various data sources are acknowledged.
Curation Questions
- What are the project’s expected benefits to the “public good”? Do they outweigh potential risks to specific sub-populations, e.g., individuals, firms, and their locations by different levels of geography?
- Are there implicit assumptions and biases regarding the studied communities in framing the project and associated data sources? If yes, how will they be addressed?
- What type of institutional approval process and contracts are needed? What statistical quality standards and confidentiality standards will be needed? For an explanation of the Institution Review Board see Note 1.
An ethics checklist can help with this process. Links to ethics checklists are provided below.
- University of Virginia, Biocomplexity Institute, Social and Decision Analytics Division Data Science Project Ethics Tool
- United Kingdom Government, Data Ethics Framework
2.2.5 Privacy & Confidentiality
Privacy is about the individual, whereas confidentiality is about the individual’s information. Privacy refers to an individual’s desire to control their information. Confidentiality refers to the researcher’s agreement with the individual, which could be an agency like the Census Bureau, regarding how their information will be handled, managed, and disseminated (Keller, Shipp, and Schroeder 2016). This is a guiding principle because it needs to be considered and embraced at the earliest possible stages of statistical product development and will impact dissemination choices.
Curation Questions
- What steps are taken to ensure the privacy and confidentiality of the data?
- What statistical methods (if any) are used to ensure the privacy and confidentiality of the data?
- How do the methods chosen to protect confidentiality affect the purposes and uses of the data?
- What stakeholder engagement and transparency are built into the process?
- Does the context surrounding the purposes, uses, and anticipated data sources require an Institutional Review Board (IRB) review and approval? If yes, is it archived?
In the United States, Institutional Review Boards (IRBs) assess the ethics and safety of research studies involving human subjects, such as behavioral studies or clinical trials for new drugs or medical devices. Today, the definition of human subjects has evolved to include secondary data, such as administrative data collected for other purposes, e.g., local property data collected for tax purposes.
The Belmont Commission was convened in the late 1970s after the ethical failures of many research projects that involved vulnerable populations surfaced. The Belmont Commission issued three principles for the conduct of ethical research:
Respect for people—treating people as autonomous and honoring their wishes
Beneficence—understanding the risks and benefits of the study and weighing the balance between (1) doing no harm and (2) maximizing possible benefits and minimizing possible harms
Justice—deciding if the risks and benefits of research are distributed fairly.
These principles were translated to a set of regulations called the Common Rule that govern federally-funded research. The Belmont Commission provided the foundation for Institutional Review Board (IRB) principles and focused on research involving human subjects in experiments and studies. IRB approval is required to be eligible for federal grants and contracts. Many universities also require IRB review for research conducted by faculty, students, and researchers (Shipp, LaLonde, and Martinez 2023).
2.2.6 Communication & Dissemination
Communication involves sharing data, statistical method choices, well-documented code, working papers, and dissemination through research team meetings, stakeholder engagements, conference presentations, publications, webinars, websites, and social media. As a principle, communication and dissemination are critical to ensure that statistical product development processes and findings are transparent and reproducible (Berman et al. 2016). An essential facet of this step is to tell the story of the analysis by conveying the context, purpose, and implications of the research and findings (Berinato 2019; Wing 2019; NASEM 2022).
Curation Questions
- Are the meeting notes, statistical products, code, reports, and presentations archived in a repository?
- Briefly describe what did not work in this process, e.g., data wrangling challenges where data sources could not be integrated, data source changes after a fitness-for-purpose assessment, analyses that were changed because assumptions were not met, etc.
- Have project methods and outputs been made as transparent as possible?
- Are the potential limitations of the research clearly presented?
- Why or why not should the research be used as the basis for an institutional or policy action?
- Have the predicted benefits and social costs to all potentially affected communities been considered?
2.3 Research Steps
2.3.1 Subject Matter Input
Subject matter (domain) expertise plays a role in translating the information acquired into understanding the underlying phenomena in the data (Box et al. 1978). Domain knowledge provides the context to define, evaluate, and interpret the findings at each research stage (Leonelli 2019; Snee, DeVeaux, and Hoerl 2014). Subject matter input can be obtained through a review of the literature, talking to experts, or learning about their work at conferences or other convenings. Subject matter experts are different than stakeholders. Both provide important input to identifying and clarifying purposes and uses.
Curation Steps
- Document the meetings with subject matter experts and stakeholders.
- Document the literature search methods and the results of the literature review.
- Document choices are made during the development of the products.
- Were subject matter experts and stakeholders recruited from underrepresented groups?
2.3.2 Data Discovery
Data discovery identifies potential sources that address the research goals defined by purposes and uses. Data sources include the following types (Keller et al. 2020):
Designed data are collected using statistically designed methods, such as surveys, censuses, and data generated from an experimental or quasi-experimental design, such as a clinical trial or agricultural field study.
Administrative data are collected for the administration of an organization or program by entities such as government agencies.
Opportunity data are derived from internet-based information, such as websites, wearable and other sensor devices, and social media, and captured through application programming interfaces (APIs) and web scraping, e.g., geocoded place-based data, transportation routes, and other data sources.
Procedural data are processes and policies, such as a change in health care coverage, a data repository policy outlining procedures and the metadata required to store data, or a responsible AI policy.
The goal of the data discovery process is to think broadly and imaginatively about all data types and to capture the variety of data sources that could be useful for the problem. There are three steps in the data discovery process (Keller, Shipp, and Schroeder 2016):
Identify potential data sources and make an inventory.
Create a set of questions to screen the data sources to ensure the data meet the criteria for use.
Select and acquire the data sources that meet the screening criteria.
Curation Steps
- Describe your data discovery process and reasoning behind the selected data sources.
- Do underrepresented groups have adequate geographic coverage? If not, are there methods, such as synthetic data, you can use to provide adequate coverage?
- Have checks and balances been established to identify and address implicit biases in the data and interpretation of the data? Has the team engaged in discussion and provided insights across their diverse perspectives?
- Describe the assumptions that need to be made to use these data sources.
- Identify and document the paradata and metadata that describe each data source. Paradata describe how the data were collected, while metadata are “data about data.” It includes information about the data’s content, data dictionaries, and technical documents that will help the user assess its fitness for purpose (Cannon 2013; NASEM 2022).
- Discuss data sources you would have used if they were available.
2.3.3 Data Ingest & Governance
Data ingestion is the process of bringing data into the data management platform(s) for use. Data governance establishes and adheres to rules and procedures regarding data access, dissemination, and destruction.
Curation Steps
- Document policies and institutional agreements for data use.
- Have team members reviewed data use agreements, standard operating procedures (SOPs), and data management plans? Are they fair?
- Do additional procedures need to be defined for this project?
- Document the code and processes used to ingest the data sources and manage governance.
2.3.4 Data Wrangling
Data wrangling includes the activities of data profiling, preparing, linking, and exploring used to assess the data’s quality and representativeness and what analyses the data can support.
| Profiling | Preparing | Linking | Exploring |
|---|---|---|---|
|
|
|
|
Curation Steps
- Describe any data quality issues within the stated purpose and use context and how they were resolved. This can include statistical solutions like imputing missing data, identifying outliers, or constructing synthetic populations.
- How representative are the data?
- What populations are and are not covered?
- Describe any issues with the wrangling process and how they were resolved.
- Document the code used to wrangle the data and describe how it was validated.
- Document assumptions made regarding the transformation and use of the data.
2.3.5 Fitness-for-Purpose
Fitness-for-purpose starts with assessing the constraints imposed on the data by the particular statistical methods used and the population to which the inferences extend. It is a function of the modeling, data quality needs of the models, and data coverage (representativeness) needs of the models. The statistical product’s “fitness-for-purpose” involves those on the receiving end of the data helping identify issues germane to the data application, such as identifying biases affecting equity. For example, given known differences in their availability, does using administrative records lead to better modeling outcomes for some groups more than others? What can be done to compensate for such bias?
Curation Steps
- Document the constraints and limitations of the data.
- What are the limitations of the results? Are the results useful, given the purpose of the study?
- Discuss the populations to which any inferences will generalize.
- Do the statistical results support the potential benefits of the study previously stated?
- Do any data require revisiting the question of potential biases being introduced through the choice of data sets and variables?
2.3.6 Statistics Development
The development of statistics and statistical products for dissemination is a function of the research questions, the data’s limitations, and the assumptions of the statistical method(s) used.
Curation Steps
- Describe the statistical methods planned and used and how the method assumptions were evaluated.
- Discuss the conclusions of the statistical analyses and any inferences that can be made from the disseminated statistical products.
- Discuss how the statistics support the purposes and uses driving the development of the products.
Here, we have defined the CDE and provided a conceptual walk through of the framework from Figure 1. In the next part of this series, the CDE framework is put into practice through a demonstration use case on the resilience of skilled nursing facilities.
3 Translating the Curated Data Model into Practice through a Demonstration Use Case: Climate Resiliency of Skilled Nursing Facilities
Vicki Lancaster, Stephanie Shipp, Sallie Keller, Henning Mortveit, Samarth Swarup, Aaron Schroeder, and Dawen Xie
University of Virginia, Biocomplexity Institute
3.1 Introduction
Here, we demonstrate how the CDE Framework described in Section 2 can be implemented for a research use case related to skilled nursing facilities. The framework provides the guiding principles for ethical, transparent, and reproducible research and dissemination and the research process for developing the statistical product.
Across the United States, federally regulated skilled nursing facilities (SNFs) provide essential care, rehabilitation, and related health services to about 1.3 million people. An SNF is a facility that meets specific federal regulatory certification requirements that enable it to provide short-term inpatient care and services to patients who require medical, nursing, or rehabilitative services. Their patients can be among the most vulnerable members of our society, and yet, historically, SNFs have not been incorporated into existing emergency response systems. For example, during the 2004 Florida hurricane season, SNFs were given the same priority as day spas for restoring electricity, telephones, water, and other essential services (Hyer et al. 2006). Even worse are the deaths of SNF residents in Louisiana following Hurricanes Katrina and Rita in 2005 (Dosa et al. 2008). This is still an issue in 2021. In Louisiana, 15 SNF residents died when evacuated to a warehouse during Hurricane Ida (2021), and 12 died in Florida as a result of Hurricane Irma (2017). In both instances, the deaths were attributed to extreme heat and lack of electricity (Skarha et al. 2021).
These events prompted the (The White House 2022) initiative, Protecting Seniors by Improving Safety and Quality of Care in the Nation’s Nursing Homes, stating, “All people deserve to be treated with dignity and respect and to have access to quality medical care.”
However, there are questions that need to be addressed to best protect SNFs and their residents. For example, how resilient are SNFs in extreme climate events? This use case demonstration shows how we built a new statistical product to address this question using the CDE Framework (V. Lancaster, Shipp, et al. 2023).
3.2 Purposes & Uses
A skilled nursing facility (SNF) is a federally regulated nursing facility with the staff and equipment to provide skilled nursing care, skilled rehabilitation services, and other related health services (Medicare & Medicaid Services 2023). The context of this use case is to create a baseline picture of SNFs in Virginia and then integrate information on the risk of extreme flood events to assess facility and community preparedness – for example, how likely are the nursing staff3 to make it to the facility in the event of a flood?
This use case has two parts. The first creates a baseline data picture of SNFs, bringing together data about the residents, nursing staff, and SNF characteristics. The second addresses two issues raised in the (The White House 2022) initiative: emergency preparedness and nurse staffing. We frame these issues into three purpose and use questions with the ultimate goal of creating statistical products that address these questions:
Can SNF workers get to work during an extreme flood event?
Are SNFs prepared for a flood emergency?
Can communities support SNFs during an emergency?
3.3 Statistical Product Development Stages
Subject Matter Input and Literature Review
The subject matter experts consulted included nursing facility administrators, SNF resident advocates, demographers, and researchers. Our discussions and literature review informed us of the many federal policies governing SNFs regarding inspections and data reporting requirements (procedural data). In addition, we were told about nonpublic data sources on residents and SNF staff that were aggregated to the SNF level and provided to the public under a grant from the National Institute on Aging. This information was important since we had yet to come across this source in our data discovery process. The dialogue with experts and our literature review helped us generate a “wish list” of variables we used to inform our data discovery process that we visualized into a conceptual data map (see Figure 3).
Data Discovery
Data discovery focused on identifying data sources to address the purpose and use questions and was informed by the conceptual data map.
For the first question – Can SNF workers get to work during an extreme flood event? – we discovered and used proprietary synthetic population, transportation routes, building data sources, and publicly available flood data. The HERE Premium Streets proprietary data includes information about roads, such as type of road, speed limits, number of lanes, etc. The proprietary synthetic population data, Building Knowledge Base (BKB), are used to identify where SNF workers live and work to map transportation routes from home to work (Mortveit, Xie, and Marathe 2023). Publicly available data from the Federal Emergency Management Administration (FEMA) provided flooding risk estimates along the routes from nursing staff homes to the SNF.
For the second question – Are SNFs prepared for a flood emergency? – we used Center for Medicare and Medicaid (CMS) SNF inspection and deficiency data as a proxy for preparedness. We also examined SNF residents’ physical and mental health to assess SNF emergency preparedness. For example, if most residents faced mobility challenges, the SNF would need more resources available during an emergency to move residents to a safer facility. We used data about residents from the Long Term Care Focus (LTCFocus 2022) Public Use Data sponsored by the National Institute on Aging (Brown University 2022).
We used data to measure community resilience, assets, and risks by geography at the county, city, and census tract levels to address the third question, Can communities support SNFs during an emergency? These data included:
- Health professional shortages area (HRSA 2022);
- Shelter facilities and emergency service providers data (Homeland Security: Geospatial Management Office 2022); and
- Community Resilience Indicator Analysis and National Risk Index for Natural Hazards (FEMA 2022).
All data are provided in a GitHub repository along with their metadata, except for the three proprietary data sources. Articles about how the synthetic estimates are constructed are provided for two of these proprietary data sources. The third data source was obtained from a private-sector vendor whose data and documentation are proprietary; a link is provided to their website.
Data Ingest & Governance
All the public data, metadata, code, statistical products, data processes, and relevant literature on SNF policies and regulations are stored in a GitHub repository.
In our experience, data wrangling is the most time-consuming and challenging part of product development. This speaks directly to the benefit of the CDE; once a researcher has wrangled together multiple data sources, it can be made available to other researchers.
The two predominant issues with data wrangling for this Use Case included reconciling data sources that contain data on the same topic and creating linkages between data sources. For example, we reviewed three hospital data sources:
- Homeland Security Infrastructure Foundation-Level Data (HIFLD) (DHS 2022)
- HealthData.gov - COVID-19 Reported Patient Impact and Hospital Capacity by State (HHS 2022)
- Map of VHHA Hospital and Health System Members (Virginia Hospital & Healthcare Association 2022)
Inconsistences and omissions observed across the three data sources included:
- non-standard hospital names and hospital classification types;
- inconsistent availability of hospital IDs (such as Medicare Provider Number);
- conflicting geographic information, including address, latitude, and longitude.
We did not attempt to reconcile these inconsistencies for the demonstration but decided to use a single source for shelter facility and emergency service provider data. We used HIFLD data since they provided the most current data (DHS 2022). The use of these data reinforces the purpose of the use case – to illuminate the challenges in creating statistical products and what the Census Bureau would need to consider.
Similar inconsistencies made it difficult to link data sources using geographic variables. For example, we used shelter facility and emergency service provider data sources from the HIFLD – including hospitals, Red Cross Chapter Facilities, National Shelter System Facilities, emergency medical service stations, fire stations, and urgent care facilities – to calculate a metric for potential community support. The goal was to place each facility in a Virginia county or independent city. Virginia is divided into 95 counties, and 38 independent cities considered county-equivalents for census purposes, and in some cases, there is a county and a city with the same name (e.g., Richmond County and Richmond City, each in different locations in Virginia). It was necessary to canonicalize the county and city names (when available), which meant aligning upper and lower cases, removing unnecessary characters, and distinguishing between county and city.4
The challenge with locating shelter facilities and emergency service providers within a county or independent city was using different variables to identify their location (latitude and longitude, address, ZIP code5, Federal Information and Processing Standard (FIPS) code, and county/city name). In cases where the data source only had a ZIP or FIPS code, a Department of Housing and Urban Development crosswalk was used to link the two codes; in other cases, a crosswalk that linked non-independent cities and towns to counties was used; and in others, a crosswalk that linked FIP codes to counties and independent cities. Researchers would benefit from exhaustive crosswalks between all variables on the same topic, such as location variables, facility names, and identification numbers, to reduce the time spent on data wrangling.
Regarding data products related to popular indices, such as climate disaster risks and community resilience, they are operationalized differently across the various departments and agencies within the federal and state governments and private and non-profit sectors. It is an enormous task to review the methodology and technology reports (if available) to understand their differences and decide which versions are most relevant (fitness-for-purpose) for a particular use case. Again, after reviewing the options for this use case, we determined that the National Risk Index for riverine and coastal floods from FEMA was the best option for climate risk estimates. The detailed technical report, National Risk Index Technical Document (FEMA 2021), provides a clear assessment of the assumptions and limitations of the data and a description of how the risk estimates were derived. Researchers would benefit from guidance on the numerous constructions of indices on the same topic. A use case on a specific index topic could be used to highlight differences and similarities among indices, which would help with data wrangling and fitness-for-use. Ideally, the use case could benchmark the various constructions and provide a statistical assessment.
3.3.1 Question 1: Can SNF workers get to work during an extreme flooding event?
Sufficient nursing staff is of significant concern to assure resident safety and quality of care.
Since proprietary synthetic population data and commercial sector digitized mapping data were used to construct the routes SNF nursing staff are likely to take from home to work, only an outline of the computational process used to identify the routes is provided. Publicly available data from FEMA were used to estimate flooding risk along a particular route. Below is a general description of the modeling steps and the proprietary data used to assess SNF vulnerability as a function of the nursing staff’s inability to report to work due to the transportation infrastructure (Choupani and Mamdoohi 2016).
Computational modules
Here is the basic outline of the process that uses proprietary data that starts at network construction and ends with routes. For more details, see the GitHub repository: Vulnerability of SNFs concerning Commuting.
- Extract network data from HERE (2021 Q1 in this use case).
- Process the extracted data to form a network suitable for routing. This includes inference of speed limits for road links where such data is missing.
- Prepare origin-destination pairs. In this case, the list of locations pairs a worker’s home and work locations. The person is constructed in the synthetic population pipeline, and residences and workplaces are derived through the data fusion process used to construct the NSSAC building database.
- Construct routes using the Quest router.
Once the routes to an SNF were established, the expected number of nursing staff at an SNF during a flood event could be calculated as the sum of the probabilities of each worker being able to commute to work during a flood event. A computational model was developed using the following data:
- SNF locations in Virginia from the Centers for Medicare & Medicaid Services (CMS);
- Home locations of workers at each SNF assigned from the synthetic population and Building Knowledge Base (Beckman, Baggerly, and McKay 1996; Mortveit, Xie, and Marathe 2023);
- Virginia road networks; and
- FEMA census tract-level riverine and coastal flood risks.
Using router software, the Virginia road network was used from the HERE map data to compute each nursing staff’s likely route to their SNF. Routers are commonly used within transportation and traffic simulators. The router software used for this demonstration is a highly parallelizable router previously developed in BI NSSAC, known as the Simba router (Barrett et al. 2013).
The FEMA risk data provide the riverine and coastal flood risks for each census tract in Virginia. Given the routes, the FEMA riverine and coastal flood risks were used to estimate the probability of the nursing staff making it to work. The FEMA technical document National Risk Index Technical Document (FEMA 2021) provides information on how natural hazard risks are calculated. We use these risk estimates ranging from 0 to 100 as a proxy for the probability a worker can reach the SNF by dividing by 100. For example, we assume a risk is zero if there is zero probability of being unable to reach the SNF due to an extreme flood event.
In contrast, a risk of 100 indicates the roads are underwater, and the probability of being unable to reach the SNF is one. The maximum risks along transportation routes leading to an SNF range from 0 to 47 for riverine flooding and 0 to 40 for coastal flooding. We assume the combined value of the maximum riverine and coastal flood risks along a worker’s transportation routes, divided by 100, is the worker’s probability of not getting to work during a flooding event.
Since we do not have data on the exact home locations of the nursing staff, we estimated how many could reach the facility by taking a random sample (whose size is the CMS average daily nursing staff6 for an SNF) from the possible routes identified using the HERE Virginia road network. We calculated the average with a 95% nonparametric confidence interval. The 283 SNFs used in our research have an average daily nursing staff of 12,609. Using the above approach, we estimated that 10,005 (95% CI: 9,013, 10,700) or 79% can get work during an extreme flood event. The individual SNF nursing staff percentage who can make it to work ranges from 48% to 93%.
Figure 4 visualizes this analysis for the 283 SNFs ordered by the observed average daily nursing staff numbers at the facility from smallest to largest, displayed using the orange line. The black line indicates the expected number in an extreme flood event and the 95% nonparametric confidence interval (grey band). The code for Figure 4 is provided in the GitHub repository.
For example, in King George County, the SNF is Heritage Hall King George (Federal Provider Number 495300 in Figure 5), located near the Potomac River, which opens to the Chesapeake Bay. According to CMS, the Heritage Hall King George facility has an average daily skilled nursing staff of 41. Using the HERE Virginia road network, we identified 101 routes the staff could use to reach the facility. The combined maximum coastal and riverine flood risks along these routes ranged from 5.6 to 66.7; a random sample of 41 from the 101 routes gives an average probability of reaching the facility of 0.74 with a 95% nonparametric confidence interval of [0.65, 0.80]. These were used to estimate the average number of nursing staff at the facility, 30, during a flood event, along with a 95% nonparametric confidence interval [14, 38]. Publicly available data from the Federal Emergency Management Administration (FEMA) provided flooding risk estimates along the routes from the nursing staff home to the SNF along with proprietary road and building information.
3.3.2 Question 2. Are SNFs prepared for emergencies?
To address this question, we examined how prepared SNFs are for emergencies using annual inspection and deficiency data as a proxy for preparedness. CMS issues deficiencies to SNFs that fail to meet federal Medicare and Medicaid preparedness standards. Every deficiency is classified into one of 12 categories based on the scope and severity of the deficiency. There are two broad types of non-health-related deficiencies:
Emergency Preparedness Deficiencies – There are four elements of emergency preparedness. They cover an emergency plan, policies and procedures, a communication plan, and training and testing.
Fire Life Safety Code – The set of fire protection requirements are designed to provide a reasonable degree of safety from fire. They cover construction, protection, and operational features designed to provide safety from fire, smoke, and panic.
We calculated separate Emergency Preparedness and Fire Life Safety Code deficiency indicesto combine them to create a single index to measure SNF preparedness and distinguish between high and low performing SNFs. The computation of the indices has four steps.
Number of Deficiencies: For each SNF, the total number of deficiencies during the past four years, 2018-2022, was divided by the number of SNF inspections over the same period to estimate the average number of deficiencies per inspection.
Time to Resolve Deficiencies: We next computed the average number of days it took to resolve each deficiency.
Scope and Severity of Deficiencies: We then transformed the deficiency letter inspection rating for scope and severity to a numerical weight using the CMS technical guide, Care Compare Nursing Home Five-Star Quality Rating System (Medicare & Medicaid Services 2022),and averaged the ratings.
The estimates from these three steps were summed to compute separate Emergency Preparedness and Fire Life Safety Code deficiency indices (see Figure 6) and are provided for reuse in a .csv file on GitHub.
Figure 6 displays the results of an exploratory data analysis for each index. These analyses assessed fitness-for-use; we wanted to construct an indicator with sufficient variability to discriminate between high and low-performing SNFs. It is evident we accomplished this in Figure 6 there are SNFs with indices outside the main body of the data. We summed the Emergency Preparedness and Fire Life Safety Code indices and categorized them into high, medium, low, and no deficiencies.
3.3.3 Question 3: Can communities support SNFs during emergencies?
To answer this question, we computed a community resiliency index using the US Census American Community Survey and the guidance provided by the Homeland Security document Community Resilience Indicator Analysis: County-Level Analysis of Commonly Used Indicators from Peer-Reviewed Research (Edgemon et al. 2018). The index was constructed by summing the county (census tract) level percentages for the following variables:
- fraction employed,
- fraction with no disability,
- fraction with a high school diploma or greater,
- fraction of households with at least one vehicle, and
- reverse GINI Index – so all indicators are in a positive direction.
Figure 7 displays the combined deficiency indices, Emergency Preparedness + Fire Life Safety Code, for each SNF with the choropleth map for the community resilience index at the census tract level. We also examined the number of shelter facilities and emergency service providers and the availability of medical staff per 10,000 residents. We constructed isochrones to establish the distance from the SNF to these potential sources of support. Working on this component of the use case highlighted the need for cross-agency data, pointing to the utility of future strategic partnering between the Census Bureau, CMS, and FEMA.
In addition to describing the population using a resilience index, we also developed a measure to present the number of shelter facilities and emergency service providers (data from Homeland Security / Homeland Infrastructure Foundation Level Data) and the availability of medical doctors (MDs) and Doctor of Osteopathic Medicine (ODs) who provide direct patient care (HRSA 2022) (Figure 8).
The number of MDs and ODs is described as a primary care health professional shortage area. HRSA defines these contiguous areas where primary medical care professionals are overutilized, excessively distant, or inaccessible to the population of the area under consideration. Figure 8 (bottom) shows that approximately one-third of the counties and independent cities have health professional shortage areas across their entire boundary, and another 40 percent have shortages within parts of their boundaries.
3.4 Guiding principles for ethical, transparent, reproducible statistical product development and dissemination.
Communication
We communicated results throughout the Demonstration Use Case research with our Census CDE Working Group (composed of former Census Bureau Directors and Communication Director, and academic and industry census experts), with the Census Bureau, at conferences such as the annual Federal Statistical Committee on Methodology, and sharing drafts to seek input and ideas. The discussions and presentations helped to shape ideas and advance our thinking about how best to address the purpose and use questions.
Stakeholder engagement
We engaged stakeholders by sharing our research and results through conference presentations at the American Community Survey Data Users Conference and the Applied Public Data Users Conference. We also shared this demonstration project at Listening Sessions with stakeholders as an example of statistical product development. The Listening Sessions bring together 7 to 12 stakeholders by topic (e.g., children’s health) or function (e.g., state demographers) to seek their ideas for new statistical products.
Equity and ethics
As described in the Introduction, there are ethics and equity issues that drew us to develop this Use Case. Here we focus on equity and ethics vis-a-vis the data choices and analyses. With regard to ethical considerations with our data discovery process, fitness-for-purpose evaluation, and analyses, two questions arose:
What role does synthetic data have to play, and how do you benchmark it to evaluate fitness-for-purpose?
How do you construct and evaluate an index with the goal of identifying vulnerable populations?
Realizing the importance of nursing staff levels, we discussed and questioned whether the synthetic data had biases and were not representative of SNF residents and employees. We benchmarked the synthetic SNF nursing staff numbers against those submitted quarterly to CMS and observed they were biased low, so we decided to use the CMS data. These data were used to estimate the average number of nursing staff that could reach the facility during an extreme flood event (Figure 4).
In this Use Case, we were fortunate to have the “truth” to benchmark the synthetic data for the average daily nursing staff at each SNF. But this was not the case for the home locations of the nursing staff, therefore, the synthetic locations were not used since we had no way to benchmark them. Ideally, we would use the actual addresses of SNF employees. Instead, we used a simulation to estimate the average risks over routes leading to the SNF. This approach could be replaced with (or benchmarked against) the Census commuting data sets (e.g., Commuting Flows or the LEHD Origin-Destination Employment Statistics) and the home census tract used as the starting point for each worker. For the number of nursing staff and their home locations, it is impossible to identify potential biases that would result in the inequitable allocation of emergency rescue resources without a thorough understanding of how the synthetic data were generated.
How one evaluates the equity of an index is a more challenging task. Questions that need to be addressed include:
How do you select the variables used to construct an indicator to guide an equitable allocation of technical assistance?
What relationship between these variables is important?
What are the differences across the numerous publicly available resilience estimators? Do some lead to a more equitable allocation of technical assistance in the event of an extreme clime event?
How do you validate a resilience estimator?
The technical document Community Resilience Indicator Analysis: County-Level Analysis of Commonly Used Indicators from Peer-Reviewed Research (Edgemon et al. 2018) identified the 20 most commonly selected variables for constructing resilience estimators from peer-reviewed research. Future research will need to validate these indices against past extreme climate events.
Privacy and confidentiality
We did not do a full disclosure review. However, some data are proprietary, and we could not release those data. We discuss how we used these data.
Dissemination
We disseminated the final version of the use case in the University of Virginia Libra Open repository (V. Lancaster, Shipp, et al. 2023).
Curation
Curation involves documenting all steps of the process so that they can be repeated, validated, reused, or extended. The final report explains the process in words. Curation must also provide the data, metadata, source code, and products. This led us to construct a GitHub repository. A README file guides the reader through the material and provides instructions for replicating the research results. Note that the README file must be downloaded for the hyperlinks to work.
3.5 Using the SNF statistical product
This potential statistical product has many uses. Federal policymakers and administrators regulate SNFs; however, they only sometimes realize the impacts on costs and the need for increased resources to meet these regulations. For example, by reviewing the aggregate inspection deficiency metrics, policymakers can target resources where they are most needed. Providing additional funding to pay workers more, improve their facilities, and address inspection deficiencies would improve the quality of SNFs.
The media and advocacy groups play a role in highlighting good and bad cases of SNF care or where communities do not have adequate assets to support SNFs during an emergency event. For example, a New Yorker article (Rafiei 2022) highlighted how nursing homes decline dramatically when bought by private equity owners. The GAO (September 22, 2023) recently identified the need for more information about private equity ownership in CMS data – a gap that CMS needs to address. And, of course, researchers and analysts are essential for conducting research that leads to creating and improving statistical products around SNFs. By releasing a regularly scheduled SNF statistical product, the changes in SNFs over time can be monitored.
3.6 What CDE capabilities have this use case demonstrated?
As demonstrated by this use case, the CDE Framework is a powerful process for guiding and curating the development of statistics to address complex purposes and uses. Additionally, use cases help illuminate technical capabilities that should be present in the data enterprise to facilitate and accelerate the reuse of data and methods in the development and dissemination of new statistical products.
This CDE demonstration is the first of many use cases needed to define and develop CDE capabilities. Underlying each use case is the curation process. Curation documents each step, including decisions that may involve tradeoffs. Curation preserves and adds value to the data. This includes organizing to facilitate data discovery and easy access; providing metadata to enable the reuse in scientific and programmatic research; enhancing the value of the data enterprise through linkages between datasets; and mapping the network of interconnections between datasets, research outputs, researchers, and institutions. Over time, a searchable curation system will be needed as a foundation for creating statistical products in the CDE.
The types of products from a use case that can benefit the larger community are only limited by the creativity of the researchers and stakeholders carrying out the use case. The products from this use case are re-useable code; integrated data sets across diverse topics for each SNF; maps and other visualizations; statistical products such as SNF deficiency indices and various indices that measure community and SNF resilience; the probability of a worker reaching an SNF in the event of extreme flooding; and a GitHub repo that provides easy access to all these products plus relevant metadata, literature, and government documents and regulations.
Conducting this use case has been an eye-opening experience as to the amount and quality of publicly available data to address our research questions. The statistical capabilities and products flowing from diverse use cases can only be identified as the program progresses.
4 Defining Purposes and Uses to Support the Development of Statistical Products in a 21st Century Census Curated Data Enterprise Environment
Stephanie Shipp, Joseph Salvo, and Vicki Lancaster
University of Virginia
4.1 Summing it up
We end where we began in Section 1. Through this four-part series, we introduced a Curated Data Enterprise (CDE) Framework (see Figure 1) that can guide the development and dissemination of statistics broadly applicable to addressing social and economic issues while ensuring replicability and reusability. The CDE provides the scaffold for scaling the statistical product development of interest to the US Census Bureau and broadly applies to official statistics agencies (Keller et al. 2022). We illustrated this through a use case on climate resiliency of skilled nursing facilities, highlighting the replicability and reusability of the capabilities that would benefit inclusion in a CDE.
As noted in the first three parts, the process begins with articulating purposes and uses through stakeholder engagement and continues by leveraging that engagement, including subject matter expertise, to inform statistical product development. Eliciting purposes and uses from stakeholders and data users is facilitated by asking questions such as:
What questions keep you awake at night because you don’t have data insights to address them? What are those purposes and uses that you need statistical products to support?
How do we collaborate and engage with you to better understand your needs and help you identify gaps in understanding regarding purpose and use?
How do we prioritize what statistical products to develop first?
Examples of purposes and uses that drive new statistical products include accurately measuring gig employment (Salvo, Shipp, and Zhang 2022a), migration due to extreme climate events (Salvo, Shipp, and Zhang 2022b), the various dimensions of housing affordability (Wu et al. 2023), and addressing the undercount of young children (Salvo, Lancaster, and Shipp 2023). Other topics that require multiple sources and types of data include creating a household living budget based on the minimum necessary to ensure an adequate standard of living (V. Lancaster, Montalvo, et al. 2023) and using this budget as a starting point for measuring insecurity across components such as food or housing (Montalvo et al. 2023).
4.2 Developing an end-to-end (E2E) Curation System
Purposes and uses defined in use cases are important to support the rapid development of statistical products. These use cases will capture the imagination of those working to address today’s critical issues and advance public understanding and trust in federal statistics. The above paragraph provides examples of purposes and uses for which we have developed use cases.
Use cases are a powerful mechanism to promote methodological research to develop and implement capabilities needed in a CDE. The objectives are to undertake research projects that have the potential to create statistical products with explicit purposes and uses that will exercise the end-to-end (E2E) curation components.
When implemented, these proposed use cases will demonstrate a sequence of capabilities needed to build the CDE, such as agile data discovery, reusing modules and data (including synthetic data), tracking the provenance of collected and generated data, reusing synthetic data and methods to integrate many types of data, conducting statistical analysis involving heterogeneous data integration, and reviewing data and statistical results with an equity and ethics lens. These steps will be captured in an E2E Curation System.
- Criteria for developing and evaluating use cases that will uncover the capabilities and research necessary to develop the CDE
Criteria are needed to evaluate, and partner with researchers and stakeholders in developing and implementing the capabilities to capture in the CDE. The choice of use cases, when curated, needs to provide unique insight into CDE capabilities and statistical product development. The capabilities to be developed include addressing some purpose and use that no single source of information can resolve, generating practical diagnostics to improve existing methods, creating pilot software, and validating new and improved statistical products. These criteria, developed through listening sessions and discussions with experts, guide the prioritization and selection of use cases and their evaluation after curation (see Table 2) (Keller et al. 2022).
Value and feasibility of the CDE approach described in the existing research (potential use case) to address emerging or long-standing issues, i.e., its purpose and use over and above existing approaches to address high-priority problems. Stakeholders’ challenges and issues as the source of purposes and uses. Subject Matter Experts to advise on the approach and implementation. Partners to access data from local and state governments, nonprofit organizations, and the private sector, and strategies to overcome legal and administrative barriers to such access that benefits to both the providers and recipients of the data. Survey, administrative, opportunity, and procedural data from multiple sources (e.g., local, state, federal, third-party) to address the purpose and use (issue) in an integrated way. There are well-defined data ingestion and governance requirements. Computation and measurement requirements for statistical products include the unit(s) of analysis and their characteristics, temporal sequence, geocoded location data, and methods for imputations, projections, and statistical analysis. Equity and ethical dimensions are considered at each step to ensure that the Use Case provides fair and accurate representation across groups and an assessment that the potential benefits outweigh the potential harm. Evidence of CDE capabilities to be built, including the code, data, and documentation to create the statistical products, which can be described in the curation step. Statistical Products include integrated data sources, indicators, maps, visualizations, storytelling, and analysis. Potential viability of proposed dissemination platforms for interactive access to data products at all levels of data acumen (Keller and Shipp 2021) while adhering to confidentiality and privacy rules. |
- An end-to-end (E2E) curation process
Curation is an E2E process defined by the context of the purposes and uses that document the decisions and tradeoffs at each step in the CDE Framework. The following curation definition will be used as it serves the CDE’s vision.
Curation involves documenting, for each statistical product, the inputs from which the product is derived, the wrangling used to transform the information into product, and the statistical product itself. Purposes and uses provide the context for each statistic and statistical product.
This definition has evolved from numerous stakeholder discussions via listening sessions and discussions with Census Bureau staff (Nusser et al. forthcoming; Faniel, Frank, and Yakel 2019; NASEM 2022).
As use cases are curated, the CDE capabilities will evolve to quickly develop statistical products. These curated use cases are integral to developing an E2E curation process for the CDE.
- Invitation to contribute purpose and use ideas for developing new statistical products
The CDE development aims to curate a significant number of use cases that address social and economic issues that have the potential to define capabilities to be built in the CDE. Initially, they are seeking ideas for purposes and uses to define these use cases and statistical products.
The skilled nursing facility use case presented in Section 3 included code, data, and documentation to calculate the probability of workers getting to work during a weather event, resilience indicators at the county or sub-county level, alternative skilled nursing home deficiency measures, and other capabilities. Another example is a use case that creates a household living budget, see (Vicki Lancaster et al. 2024).
Incorporating capabilities in the CDE
To accelerate the development of statistical products, the Census Bureau will develop use cases to articulate and create CDE capabilities. This requires identifying those valuable nuggets for learning and quickly translating and incorporating this information into the CDE. Examples of critical capabilities of interest are learning about the utility of synthetic data, the ability to aggregate data into custom geographies, and combining different units of analysis. The expected outcome is the creation of an innovative 21st Century Census Curated Data Enterprise focused on purposes and uses that overcome the limitations and challenges of today’s survey-alone model.
The 21st Century Census Curated Data Enterprise development presents an opportunity for researchers to help drive the development of the CDE as the foundation for creating new statistical products. The US Census Bureau is seeking ideas for purposes and uses that will define new statistical products. They are interested in research projects (use cases) that are guided by the CDE framework as potential new statistical products. They want to learn from and understand your experiences in using the CDE framework, for example, what worked well, what challenges you faced, how each step in the framework was curated, and what capabilities are replicable and reusable for developing and enhancing statistical products.
About the authors
Stephanie Shipp leads the Curated Data Enterprise research porfolio and collaborates with the US Census. She is an economist with experience in data science, survey statistics, public policy, innovation, ethics, and evaluation.
Vicki Lancaster is a statistician with expertise in experimental design, linear models, computation, visualizations, data analysis, and interpretation.
Sallie Keller is the Chief Scientist and Associate Director of Research and Methodology at the US Census Bureau. She is a statistician with research interest in social and decision informatics, statistics underpinnings of data science, and data access and confidentiality.
Joseph Salvo is a demographer with experience in US Census Bureau statistics and data. He makes presentations on demographic subjects to a wide range of groups about managing major demographic projects involving the analysis of large data sets for local applications.
Henning Mortveit’s research interests include massively interacting systems and the mathematics supporting rigorous analysis and understanding of their stability and resiliency.
Samarth Swarup is a computer scientist with experience in computational social science, resiliency and sustainability, and stimulation analytics.
Aaron Schroeder’s experience is in the technologies and related policies of information and data integration and systems analysis, and policy and program development and implementation.
Dawen Xie is interested in Geographic Information Systems (GIS), visual analytics, information management systems, and databases, with a current focus on building different dynamic web systems.
References
Footnotes
https://www.census.gov/newsroom/blogs/director/2023/01/a-look-ahead-2023.html ↩︎
Nursing staff includes medical aides and technicians, certified nursing assistants, licensed practical nurses (LPNs), LPNs with administrative duties, registered nurses (RNs), RNs with administrative duties, and the RN director of nursing.↩︎
For example, distinguishing county from city when the name is the same could be done using State/County FIPS codes. Richmond County is 51159; Richmond City is 51760.↩︎
ZIP code is a system of postal codes used by the United States Postal Service. ZIP was chosen to indicate mail travels more quickly when senders use the postal code.↩︎
Average Daily Nursing Staff is the daily number of Medical Aides and Technicians, CNAs, LPNs, LPNs with administrative duties, RNs, RNs with administrative duties, and RN Director of Nursing averaged over three months.↩︎